| average_customer | |
|---|---|
| Year_Birth | 1968.917 |
| Income | 52236.58 |
| Kidhome | 0.441934 |
| Teenhome | 0.5056484 |
| Dt_Customer | 2013-07-10 |
| Recency | 49.00768 |
| MntWines | 305.1536 |
| MntFruits | 26.32399 |
| MntMeatProducts | 166.9625 |
| MntFishProducts | 37.63534 |
| MntSweetProducts | 27.03479 |
| MntGoldProds | 43.91143 |
| NumDealsPurchases | 2.32535 |
| NumWebPurchases | 4.087664 |
| NumCatalogPurchases | 2.671487 |
| NumStorePurchases | 5.805242 |
| NumWebVisitsMonth | 5.321735 |
| AcceptedCmp3 | 0.07365567 |
| AcceptedCmp4 | 0.07410755 |
| AcceptedCmp5 | 0.07275192 |
| AcceptedCmp1 | 0.06416629 |
| AcceptedCmp2 | 0.01355626 |
| Complain | 0.009037506 |
| Z_CostContact | 3 |
| Z_Revenue | 11 |
| Response | 0.1504745 |
| Age | 52.08269 |
| education | Graduation |
| status_marital | Married |
| Cluster 1 - Low Value Customers | Cluster 2 - High Value Customers |
|---|---|
|
|
Among the most important variables for the proposed machine learning model, the number of days since the last purchase (‘Recency’) is very important.
Because purchasing in store, on the web, or via the catalog (‘NumStorePurchases’, ‘NumWebPurchases’, ‘NumCatalogPurchases’) is positively correlated with ‘Income’. Eventually, these variables become significant.
The history of past campaigns are also crucial factors to this prediction.
Nevertheless, variables such as ‘Complain’,‘Age’ of the customer or ‘Marital_Status’ are not so crucial for the model.
---
title: 'R Notebook: iFood CRM Data Analyst Case'
author: "Eduar Felipe Riaño Torres^[felipehuman@gmail.com]"
date: "`r Sys.Date()`"
output:
flexdashboard::flex_dashboard:
#orientation: rows
social: menu
source_code: embed
theme: journal
---
```{r setup, include=FALSE}
# summarization
library(skimr)
library(Hmisc)
library(knitr) # some tables and R Markdown
library(correlationfunnel) # correlation Analysis
# articulation with Python
library(reticulate)
# general visualization
library(ggplot2) # visualization
library(scales) # visualization
library(grid) # visualization
library(gridExtra) # visualization
library(RColorBrewer) # visualization
library(corrplot) # visualization
library(reshape2) # visualization
library(hrbrthemes) # visualization
# general data manipulation
library(dplyr) # data manipulation
library(readr) # input/output
library(data.table) # data manipulation
library(tibble) # data wrangling
library(tidyr) # data wrangling
library(stringr) # string manipulation
library(forcats) # factor manipulation
library(tidyverse) # plotting, cleaning, etc
library(gdata)
library(plyr)
# specific visualization
library(alluvial) # visualization
library(ggrepel) # visualization
library(ggforce) # visualization
library(ggridges) # visualization
library(gganimate) # animations
library(gridExtra) # visualization
library(GGally) # visualization
library(ggExtra) # visualization
library(highcharter) # visualization
library(countrycode) # visualization
library(geofacet) # visualization
library(wesanderson) # color palettes
library(treemapify) # visualization
library(cluster) # visualization
library(gridExtra) # visualization
library(grid) # visualization
library(plotly)
library(magick)
# specific data manipulation
library(lazyeval) # data wrangling
library(broom) # data wrangling
library(purrr) # string manipulation
library(reshape2) # data wrangling
library(rlang) # encoding
# maps / geospatial
library(geosphere) # geospatial locations
library(leaflet) # maps
library(leaflet.extras) # maps
library(maps) # maps
# text / NLP
library(tidytext) # text analysis
library(tm) # text analysis
library(SnowballC) # text analysis
library(topicmodels) # text analysis
library(wordcloud) # test visualization
# analysis
library(lubridate)
library(tidyverse)
library(caret)
library(xgboost)
library(modeest)
library(NbClust)
library(factoextra)
library(tidymodels) # framework for ML
library(ranger)
library(randomForest)
library(vip)
library(DataExplorer)
library(tidyquant)
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard)
library(knitr)
library(heatmaply)
library(corrplot)
# Make some noisily increasing data
set.seed(955)
dat <- data.frame(cond = rep(c("A", "B"), each=10),
xvar = 1:20 + rnorm(20,sd=3),
yvar = 1:20 + rnorm(20,sd=3))
setwd("C:/Users/eduar/Downloads/My Things/Data sets R/iFood")
raw_ifood <- read.csv("ml_project1_data.csv")
raw_ifood <- raw_ifood %>% filter(.,!is.na(Income))
raw_ifood$Income <- as.numeric(raw_ifood$Income %>% gsub("[$,]","",.))
current_year = 2021
raw_ifood <- mutate(raw_ifood, Age = current_year - raw_ifood$Year_Birth)
raw_ifood <- raw_ifood %>%
filter(.,Age<100)
raw_ifood$Dt_Customer <- ymd(raw_ifood$Dt_Customer)
data_numeric <- raw_ifood %>% select_if(., is.numeric) %>% select(-c("AcceptedCmp1","AcceptedCmp2",
"AcceptedCmp3","AcceptedCmp4",
"AcceptedCmp5","Recency",
"Complain","ID","Z_CostContact",
"Z_Revenue","Year_Birth"))
# Discrete variables
people_discrete <- raw_ifood %>% select(c('Education','Marital_Status',
'Kidhome','Teenhome','Complain'))
# Continuous variables
people_continuous <- raw_ifood %>% select(c('Year_Birth','Income',
'Dt_Customer','Recency'))
data_products <- raw_ifood %>% select(c('MntWines','MntFruits',
'MntMeatProducts',
'MntFishProducts',
'MntSweetProducts',
'MntGoldProds'))
data_place <- raw_ifood %>% select(c('NumWebPurchases','NumCatalogPurchases',
'NumStorePurchases','NumWebVisitsMonth'))
data_promotion <- raw_ifood %>% select(c('AcceptedCmp1','AcceptedCmp2','AcceptedCmp3',
'AcceptedCmp4',
'AcceptedCmp5',
'Response',
'NumDealsPurchases'))
product_key_value <- gather(data_products)
place_key_value <- gather(data_place)
people_key_value_disc <- gather(people_discrete)
people_key_value_cont <- gather(people_continuous)
promotion_key_value <- gather(data_promotion)
# Counting ones
campaign_takeup <- raw_ifood %>%
select('AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3',
'AcceptedCmp4','AcceptedCmp5', 'Response') %>%
colSums()
# Counting zeros
zero <- function(x) sum(x == 0)
campaign <- raw_ifood %>%
select('AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3',
'AcceptedCmp4','AcceptedCmp5', 'Response')
rechazo <- numcolwise(zero)(campaign)
rechazo <- as.data.frame(t(as.matrix(rechazo)))
colnames(rechazo) <- "no_aceptacion"
# Making a data frame with Acceptance and Rejected campaigns
campaign_takeup <- data.frame(campana = c('AcceptedCmp1', 'AcceptedCmp2',
'AcceptedCmp3',
'AcceptedCmp4',
'AcceptedCmp5',
'Response'),
aceptacion = campaign_takeup[1:6],
no_aceptacion = rechazo)
ifood_df_clustering = raw_ifood %>%
select(-ID,-Education,-Marital_Status,-Dt_Customer,
-Year_Birth,-Z_CostContact,-Z_Revenue,-Response)
k2 <- kmeans(scale(ifood_df_clustering), 2, iter.max = 100,
nstart = 50, algorithm = "Lloyd")
```
Customer Profile Analysis
=======================================================================
Row
-------------------------------------
### Product Feature
```{r}
# Plotting bar plots for product variables
p1 <- ggplot(product_key_value, aes(value)) +
geom_histogram(alpha=.3, colour = "brown3", fill = "firebrick1", notch = TRUE) +
facet_wrap(~key, scales = 'free_x', ncol = 3) +
labs(title = "Boxplot of Product Variables") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p1)
```
### Promotion Feature
```{r}
# Plotting bar-plots for promotion variables
p2 <- ggplot(promotion_key_value, aes(value)) +
geom_histogram(stat = 'count', colour = "brown3", fill = "firebrick1",
position="identity",alpha=.3) +
facet_wrap(~key, scales = 'free_x', ncol = 3) +
labs(title = "Bar Plots of Promotion Variables") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p2)
```
Row
-------------------------------------
### Marketing Feature
```{r}
campaign_takeup_long <- melt(campaign_takeup)
p3 <- ggplot(campaign_takeup_long, aes(x = campana, y = value,
fill = variable))+
geom_bar(stat = "identity", position= "dodge",
fill = c("#D20000"), alpha=.50,show.legend = NA) +
theme_bw() +
labs(x = "Campaign", y = "Acceptance",
title="Acceptance of Marketing Campaigns") +
coord_flip() +
scale_y_continuous(labels = scales::comma)
ggplotly(p3)
```
### General Analysis
```{r}
average_customer_num <- raw_ifood %>%
select_if(names(.)=="Dt_Customer" | sapply(., is.numeric)) %>%
select(-ID) %>%
summarise_each(funs(mean)) %>%
t() %>%
as.data.frame() %>%
format(scientific = F, digits = 2) %>%
setnames("V1", "average_customer")
education <- mlv(raw_ifood$Education, method="mfv")
status_marital <- mlv(raw_ifood$Marital_Status, method="mfv")
categ <- data.frame(education,status_marital)
average_customer_categ <- categ %>%
t() %>%
as.data.frame() %>%
setnames("V1", "average_customer")
average_customer <- rbind(average_customer_num, average_customer_categ)
knitr::kable(average_customer)
```
Correlation Analysis
=======================================================================
Column {data-width=550}
-------------------------------------
### Heat Map
```{r}
colores <- colorRampPalette(c("dodgerblue", "ghostwhite", "firebrick2"))(20)
crr <- cor(data_numeric, use="complete.obs")
heatmaply_cor(crr,
xlab = "Features",
ylab = "Features",
k_col = 2,
k_row = 2
)
```
Column {data-width=450}
-------------------------------------
### Correlation Plot
```{r, fig.height=6}
corrplot(corr = crr,
method="number",
col = colores,
type="upper",
tl.col="black",
order="hclust",
number.cex = 0.71)
```
### Correlation Funnel
```{r}
customer_ifoof_binarized <- raw_ifood %>%
select(-ID,-Year_Birth,-Z_CostContact,-Z_Revenue, -Dt_Customer) %>%
binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE)
customer_response_corr <- customer_ifoof_binarized %>%
correlate(Response__1)
customer_response_corr %>%
plot_correlation_funnel()
```
Customer segmentation
=======================================================================
Row
-----------------------------------------------------------------------
### Segmentation by Response
```{r}
p5 <- raw_ifood %>%
ggplot(aes(x = Age, y = Income, color = Teenhome, size = Recency)) +
geom_point(alpha = 0.25) +
labs(title = "Customers response according to their age and income") +
facet_wrap(~Response) +
scale_y_continuous(labels = scales::comma)
ggplotly(p5)
```
### Education group by Marital Status
```{r}
graph_3 <- ggplot(raw_ifood, aes(x = Education, fill= Marital_Status)) +
geom_bar(position = position_fill(), alpha=.60) +
scale_fill_manual(values = c("firebrick4","darkred","#C00000","#FF3334",
"#FF6F77","#FFBBC1","#FFDEE3","#FF8896"))
ggplotly(graph_3)
```
Row
-----------------------------------------------------------------------
### Cluster segmentation by clustering
```{r}
set.seed(15)
clus <- ggplot(raw_ifood, aes(x = Age, y = Income, size = Teenhome)) +
geom_point(alpha = 0.25, stat = "identity",
aes(color = as.factor(k2$cluster))) +
scale_color_discrete(name=" ",
breaks=c("1", "2"),
labels=c("Cluster 1", "Cluster 2")) +
ggtitle("Segments of Customers",
subtitle = "Using K-means Clustering") +
scale_y_continuous(labels = scales::comma)
ggplotly(clus)
```
### Customers Segmentation
+---------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| Cluster 1 - **Low Value Customers** | Cluster 2 - **High Value Customers** |
+=========================================================================================================+=========================================================================================================+
| - Low or average level of income | - High level of income |
| | |
| - The majority has one kid or teenager at home | - Meat and wine are preferred |
| | |
| - Represents the most part of basic level of education | - The majority has no children |
| | |
| - Low number of purchases through store purchase. They prefer web purchases or make catalog purchases | - Low web visit and high store purchase |
| | |
| - Negative effect of having kids and teens on advertising campaign acceptance | - Number of store purchases decreases when there are kids |
| | |
| | - Selection of wines and fruits, as well as the attractive deals attract customers with higher income |
+---------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
Predictive Model
=======================================================================
Column
-------------------------------------
### Machine Learning Model Random Forest
```{r}
tipo_modelo <- "Bootstrapping"
valueBox(tipo_modelo,
icon = "fa-dice")
```
### Model Accuracy
```{r}
exactitud <- 0.8860759
valueBox(exactitud, icon = "fa-compass")
```
### ROC Curve and AUC
```{r}
roc_auc <- 0.9001666
valueBox(roc_auc, icon = "fa-bar-chart")
```
### Number of trees
```{r}
number_trees <- 500
valueBox(number_trees,
icon = "fa-leaf")
```
### Sample size
```{r}
tamano_mues <- 1660
valueBox(tamano_mues,
icon = "fa-pie-chart")
```
### Model Analysis
- Among the most important variables for the proposed machine learning model, the number of days since the last purchase ('Recency') is very important.
- Because purchasing in store, on the web, or via the catalog ('NumStorePurchases', 'NumWebPurchases', 'NumCatalogPurchases') is positively correlated with 'Income'. Eventually, these variables become significant.
- The history of past campaigns are also crucial factors to this prediction.
- Nevertheless, variables such as 'Complain','Age' of the customer or 'Marital_Status' are not so crucial for the model.
Column
-----------------------------------------------------------------------
### ROC Curve and AUC
```{r}
curva_roc <- image_read('C:/Users/eduar/Downloads/My Things/Data sets R/iFood/curvaroc.jpg')
curva_roc
```
### Feature Importance
```{r}
importancia <- image_read('C:/Users/eduar/Downloads/My Things/Data sets R/iFood/feature_importance.jpg')
importancia
```